Performance of a Large Language Model in Screening Citations

Key Points Question How accurate and efficient is a large language model (LLM) for screening titles and abstracts for article inclusion in a systematic review? Findings In this diagnostic study, LLM-assisted citation screening exhibited acceptable sensitivity and reasonably high specificity in evaluating 5 clinical questions, with post hoc prompt modifications further improving accuracy. The screening time for 100 studies was significantly reduced compared with that of conventional methods. Meaning These findings suggest that LLM-assisted citation screening could offer a reliable and time-efficient alternative to systematic review processes.


Management of Sepsis and Septic Shock
Briefly, members of the J-SSCG 2024 working group conducted extensive literature searches for these clinical questions (CQs) using CENTRAL, PubMed, and Ichushi-Web (the Japanese biomedical literature database), with a comprehensive search strategy that included all key studies.The literature was restricted to articles written in Japanese and English.Subsequently, all titles and abstracts were downloaded, organized, and deduplicated using EndNote (Clarivate Analytics, Philadelphia, PA, USA) for citation management for J-SSCG 2024. 1

Conventional citation screening
EndNote-processed files were imported into Rayyan, 2 a web application designed to facilitate systematic reviews.Subsequently, two independent reviewers screened titles and abstracts, resolving any discrepancies either by consensus or by the judgment of a third reviewer.The literature selected through these conventional citation screening methods was used as a reference standard.The detailed of the conventional citation screening are described in the Supplementary File.Notably, conventional screening results are independent of the large language model (LLM)-assisted citation screening process and are concealed from the LLM in the current analysis to preserve the integrity of the evaluation process.In addition, the authors of this study were not involved in the conventional literature search of the selected CQs.We used previously published data on the duration of the screening session recorded using Rayyan's builtin time-tracking feature to obtain the time required to conduct the conventional screening method. 2,3 Statistics on accuracy of the LLM-assisted citation screening in the secondary analysis LLM-assisted citation screening selected 38 publications in CQ1 (0.67%), 9 in CQ2 (0.26%), 5 in CQ3 (0.48%), 37 in CQ4 (0.86%), and 27 in CQ5 (1.20%) in the title/abstract screening session.In the secondary analysis, the sensitivity and specificity with 95% CI obtained from the LLM-assisted screening were 0.34 [0.26-0.43]and 1.00 [1.00-1.00]for CQ1, 0.53 [0.30-0.74]and 1.00 [1.00-1.00]for CQ2, 0.36 [0.16-0.62]and 1.00 [0.99-1.00]for CQ3, 0.53 [0.41-0.64]and 0.99 [0.98-0.99]for CQ4, and 0.69 [0.53-0.82]and 0.99 [0.99-0.99]for CQ5, respectively (Figure 3).

Modified command
In the primary analyses, we observed three false-negative publications in CQ2, two in CQ3, and three in CQ4 (eTable 2 in the Supplement).When considering why the LLM opted for exclusion, we found that the LLM strictly followed the inclusion criteria according to the CQ framework (eTable 4 in the Supplement).Therefore, we modified the command prompt to make the inclusion more flexible to increase the sensitivity of LLM-assisted citation screening (eFigure 3 in the Supplement).

Impact of LLM-based methods on the outcomes of the meta-analyses
On April 17, 2024, only the results of the meta-analysis for CQ4 were publicly available. 4Thus, we conducted a meta-analysis for CQ4, excluding studies identified as false negatives in LLM-assisted screening, and compared these findings with those obtained using conventional methods.The list of included studies for qualitative analysis using the conventional method was set as the standard reference.b The list of included studies after the title/abstract screening using the conventional method was set as a standard reference.The study focused on older critically ill patients (aged 65 years or older).Our systematic review and meta-analysis required the inclusion of adult patients aged 18 or older without an upper age limit specified.This study did not meet the criteria as it exclusively considered an older patient cohort.Moreover, it investigated permissive hypotension rather than targeting a higher mean arterial pressure compared with a lower one, which did not align with the intervention criteria for our review.The study was designed as a crossover study, which is not the same as a simple randomized controlled trial (RCT).This design introduced an additional layer of complexity where each patient receives both the intervention and the control, which was not consistent with the simple parallel-group RCT structure required as per the eligibility criteria.The effect of early goal lactate clearance rate on the outcome of septic shock patients with severe pneumonia The study was described as a randomized perspective study, indicating a prospective study design.However, it was not explicitly stated as a randomized controlled trial (RCT), which is a specific requirement for inclusion in the review.The absence of clear confirmation that the study was a randomized controlled trial necessitated its exclusion based on the predefined eligibility criteria.The content of the command prompt for the GPT-4 citation screening task included the patient/population/problem, intervention, comparison, and study design of clinical questions (CQ).This figure shows an example for CQ1.
You are conducting a systematic review and meta-analysis, focusing on a specific area of medical research.Your task is to evaluate research studies and determine whether they should be included in your review.To do this, each study must meet the following criteria: Target Patients: Adult patients (18 years old or older) diagnosed with or suspected of having infection, bacteremia, or sepsis.
Intervention: The study investigates the effects of balanced crystalloid administration.
Comparison: The study compares the above intervention with 0.9% sodium chloride administration.
Study Design: The study must be a randomized controlled trial.
Additionally, any study protocol that meets these criteria should also be included.
However, you should exclude studies in the following cases: The study does not meet all of the above eligibility criteria.The study's design is not a randomized controlled trial.Examples of unacceptable designs include case reports, observational studies, systematic reviews, review articles, animal experiments, letters to editors, and textbooks.After reading the title and abstract of a study, you will decide whether to include or exclude it based on these criteria.Please answer with include or exclude only.---------------------------------------------------------------------------------------------------------------------- You are conducting a systematic review and meta-analysis, focusing on a specific area of medical research.Your task is to evaluate research studies and determine whether they should be included in your review.To do this, each study must meet the following criteria: Target Patients: The study includes adult patients diagnosed with or suspected of having infection, bacteremia, or sepsis.If there is a possibility that the study population includes patients with sepsis, the study should be included.
Intervention: The study investigates the effects of balanced crystalloid administration.
Comparison: The study compares the above intervention with 0.9% sodium chloride administration.
Study Design: The study must be a randomized controlled trial.
Additionally, any study protocol that meets these criteria should also be included.
However, you should exclude studies in the following cases: The study does not meet all of the above eligibility criteria.The study's design is not a randomized controlled trial.Examples of unacceptable designs include case reports, observational studies, systematic reviews, review articles, animal experiments, letters to editors, and textbooks.After reading the title and abstract of a study, you will decide whether to include or exclude it based on these criteria.If there is uncertainty in the decision due to a lack of adequate information as you evaluate each domain, you will answer include to minimize the possibility of inadvertently excluding potentially relevant literature.Please answer with include or exclude only.Post hoc primary analysis adopted the original prompt and a majority vote strategy.
The results of the included publications for qualitative analysis using the conventional method were used as the standard reference.In the upper panel, the individual You are conducting a systematic review and meta-analysis, focusing on a specific area of medical research.Your task is to evaluate research studies and determine whether they should be included in your review.To do this, each study must meet the following criteria: Target Patients: Adult patients (18 years old or older) diagnosed with or suspected of having infection, bacteremia, or sepsis.
Intervention: The study investigates the effects of balanced crystalloid administration.
Comparison: The study compares the above intervention with 0.9% sodium chloride administration.
Study Design: The study must be a randomized controlled trial.
Additionally, any study protocol that meets these criteria should also be included.
However, you should exclude studies in the following cases: The study does not meet all of the above eligibility criteria.The study's design is not a randomized controlled trial.Examples of unacceptable designs include case reports, observational studies, systematic reviews, review articles, animal experiments, letters to editors, and textbooks.After reading the title and abstract of a study, you will decide whether to include or exclude it based on these criteria.Let's think step by step.Please answer with include or exclude only.

4 . 1 .
Model-Assisted Citation Screening eFigure 1. Command Prompt for the LLM Citation Screening Task eFigure 2. Comparison of Citation Screening Time for 100 Studies Between the Large Language Model-Assisted and Conventional Methods eFigure 3. Modified Command Prompt for the LLM Citation Screening Task in the Post Hoc Analysis eFigure 4. Post Hoc Analysis for the Secondary Analysis Using the Modified Prompt eFigure 5. Post Hoc Analysis for the Primary Analysis Using the Original Prompt and a Majority-Vote Strategy eFigure 6. Post Hoc Analysis for the Secondary Analysis Using the Original Prompt and a Majority-Vote Strategy eFigure 7. Post Hoc Analysis for the Secondary Analysis Using the Modified Prompt and a Majority Vote-Strategy eFigure 8. Modified Command Prompt Integrating the Chain-of-Thought Strategy for the LLM Citation Screening Task in the Post Hoc Analysis eFigure 9. Post Hoc Analysis for the Primary Analysis Using the Original Prompt and the Chain-of-Thought Strategy eFigure 10.Post Hoc Analysis for the Secondary Analysis Using the Original Prompt and the Chain-Of-Thought Strategy eFigure 11.Post Hoc Analysis for the Primary Analysis Using the Modified Prompt and the Chain-of-Thought Strategy Prompt and the Chain-of-Thought Strategy eFigure 13.Forest Plots of Pairwise Meta-Analyses for Short-Term Mortality eFigure 14.Forest Plots of Pairwise Meta-Analyses for ICU Mortality eFigure 15.Forest Plots of Pairwise Meta-Analyses for ICU Length Of Stay eFigure 16.Forest Plots of Pairwise Meta-Analyses for Ventilator-Free Days eReferences.This supplemental material has been provided by the authors to give readers additional information about their work.eAppendix Table of Contents 1. Clinical questions in the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock 2. Conventional citation screening 3. Statistics on accuracy of the LLM-assisted citation screening in the secondary analysis Overall citation screening time for 100 studies between the LLM-assisted and conventional methods 5. Modified command based on the false negative studies 6. Statistics on accuracy of the LLM-assisted citation screening in the post hoc analysis 7. Impact of LLM-based methods on the outcomes of the meta-analyses eAppendix Clinical questions in the Japanese Clinical Practice Guidelines for

4 .
Overall citation screening time for 100 studies between the LLM-assisted and conventional methods Using the conventional citation screening approach, the median [minimum to maximum] time (min) lapsed to screen 100 studies for each CQ during the title/abstract screening was 14.7 [13.1, 18.7] min for CQ1, 11.9 [10.0, 31.1]min for CQ2, 16.3 [7.9, 31.2]min for CQ3, 15.8 [13.1, 27.7] min for CQ4, and 11.7 [10.5, 17.1] min for CQ5.The processing time per study for each CQ in the LLM-assisted citation screening was 1.3 min for CQ1, 1.3 min for CQ2, 1.3 min for CQ3, 1.3 min for CQ4, and 1.3 min for CQ5 (eTable 3 in the Supplement).

eTable 1 .
List of the Patient/Population/Problem, Intervention, and Comparison of the Selected Clinical Questions question; ICU: intensive care unit of the study was the incidence, severity, and mortality of multiple organ dysfunction syndrome (MODS), not sepsis or septic shock specifically.Although MODS could be a consequence of severe sepsis and patients with severe sepsis might have been included in the study, the criteria for inclusion required the study to focus on adult patients with sepsis or septic shock.Additionally, it was not clear if the intervention included tissue perfusion parameters mentioned in the criteria (lactate clearance, capillary refill time, ScvO2/SvO2, and P(v-a) CO2/C (a-v) O2) as part of the EGDT protocol.The abstract mentioned blood lactate concentration but did not specify whether the EGDT protocol adhered to the specific perfusion parameters criteria required for inclusion in the review.critically ill adult patients and investigated the effects of hemodynamic therapy on outcomes but did not specifically state that the patient population had sepsis or septic shock.The intervention also did not mention the tissue © 2024 Oami T et al.JAMA Network Open.LLM: large language model, CQ: clinical question a Included studies using a modified command prompt for the post hoc study in the inclusion criteria (lactate/lactate clearance, capillary refill time, ScvO2/SvO2, P(v-a) CO2/C (a-v) O2).Instead, the study focused on achieving supranormal levels of the cardiac index or normal levels of mixed venous oxygen saturation, which were not among the specified parameters.Therefore, the study did not meet the eligibility criteria for the systematic review and meta-analysis.

- You ChatGPT Include eFigure 2 .
Comparison of Citation Screening Time for 100 Studies Between the Large Language Model-Assisted and Conventional Methods The difference in processing time was −15.25 min (95% confidence interval [−17.70 to −12.79], p < 0.001).An unpaired t-test was used for the analysis.eFigure 3. Modified Command Prompt for the LLM Citation Screening Task in the Post Hoc Analysis The modified command prompt for each clinical question (CQ) includes additional descriptions to the original version of the command prompt, highlighted in red character.This figure shows an example for CQ1.

eFigure 4 .
Post Hoc Analysis for the Secondary Analysis Using the Modified Prompt Post hoc secondary analysis adopted the modified prompt based on the false negative studies.The results of the included publications for the full text screening session using the conventional method were used as the standard reference.In the upper panel, the individual sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).eFigure 5. Post Hoc Analysis for the Primary Analysis Using the Original Prompt and a Majority-Vote Strategy sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).eFigure 6. Post Hoc Analysis for the Secondary Analysis Using the Original Prompt and a Majority-Vote Strategy Post hoc secondary analysis adopted the original prompt and a majority vote strategy.The results of the included publications for the full text screening session using the conventional method were used as the standard reference.In the upper panel, the individual sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).eFigure 8. Modified Command Prompt Integrating the Chain-of-Thought Strategy for the LLM Citation Screening Task in the Post Hoc Analysis T et al.JAMA Network Open.

eFigure 9 .
Post Hoc Analysis for the Primary Analysis Using the Original Prompt and the Chain-of-Thought Strategy Post hoc secondary analysis adopted the original prompt and the chain of thought strategy.The results of the included publications for the full text screening session using the conventional method were used as the standard reference.In the upper panel, the individual sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).

eFigure 10 .
Post Hoc Analysis for the Secondary Analysis Using the Original Prompt and the Chain-Of-Thought Strategy Post hoc primary analysis adopted the original prompt and the chain of thought strategy.The results of the included publications for qualitative analysis using the conventional method were used as the standard reference.In the upper panel, the individual sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).eFigure 11.Post Hoc Analysis for the Primary Analysis Using the Modified Prompt and the Chain-of-Thought Strategy Post hoc primary analysis adopted the modified prompt and the chain of thought strategy.The results of the included publications for qualitative analysis using the conventional method were used as the standard reference.In the upper panel, the individual sensitivity for each clinical question (CQ) and integrated sensitivities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).In the lower panel, the individual specificity for each CQ and integrated specificities across CQ1-5 are shown, with confidence intervals and inconsistency values (I 2 ).eFigure 13.Forest Plots of Pairwise Meta-Analyses for Short-Term Mortality ScvO2 vs Usual care (full studies) ScvO2 vs Usual care (two studies excluded [Chen 2007, Gattinoni 1995]) Lactate vs ScvO2 (full studies) Lactate vs ScvO2 (one study excluded [Tian 2012])No excluded studies for the comparison between CRT vs Lactate and P(v-a)CO2/C(av)O2 vs ScvO2.

eFigure 14 .
Forest Plots of Pairwise Meta-Analyses for ICU Mortality ScvO2 vs Usual care (full studies) ScvO2 vs Usual care (two studies excluded [Chen 2007, Gattinoni 1995])No excluded studies for the comparison between Lactate vs Usual care.

eFigure 15 .
Forest Plots of Pairwise Meta-Analyses for ICU Length Of Stay ScvO2 vs Usual care (full studies) ScvO2 vs Usual care (one study excluded [Gattinoni 1995]) Lactate vs ScvO2 (full studies) Lactate vs ScvO2 (one study excluded [Tian 2012])No excluded studies for the comparison between Lactate vs Usual care, CRT vs Lactate, and P(v-a)CO2/C(a-v)O2 vs ScvO2.

eFigure 16 .
Forest Plots of Pairwise Meta-Analyses for Ventilator-Free Days ScvO2 vs Usual care (full studies) Lactate vs ScvO2 (full studies) CRT vs Lactate (full studies) P(v-a)CO2/C(a-v)O2 vs ScvO2No excluded studies for the comparison between ScvO2 vs Usual care, Lactate vs ScvO2, CRT vs Lactate, and P(v-a)CO2/C(a-v)O2 vs ScvO2.
eTable 2. Statistics on the Accuracy of Large Language Model-Assisted Citation CQ, clinical question; FN, false negative; FP, false positive; TN, true negative; TP, true positive a eTable 5. Post Hoc Analysis for Evaluating the Accuracy of Large Language a The list of included studies for qualitative analysis using the conventional method was set as the standard reference.bThelist of included studies after the title/abstract screening using the conventional method was set as the standard reference.